Table of Contents ¶
- Introduction
- Data Inspection
- Data Loading
- Check for Missing Values
- rename columns
- View Data Information
- Show Information of All Categorical Features
- Categorical Digitization
- Data Analysis
- Descriptive Analysis
- Employee Turnover Ratio
- Relationship Between Employee Satisfaction and Years Spent at Company
- Comparison of Employee Turnover by Department
- Employee Salary Levels by Department
- Impact of Employee Salary on Turnover Rate
- Comparison of Employee Promotions in Last 5 Years and Turnover
- Comparison of Employee Performance Evaluation and Turnover
- Modeling
Introduction¶
The Human Resources Analytics project uses a simulated dataset to understand why the best and most experienced employees are leaving the company.
By analyzing factors like salary, department, and employee satisfaction, this project aims to identify patterns contributing to employee attrition. The insights gained will help HR and management develop effective strategies to improve employee retention and foster a more productive work environment.
Data Source: https://www.kaggle.com/datasets/colara/human-resource/data
Purpose
Data Analysis:
Descriptive Analysis: To provide an overview of employee turnover metrics such as turnover ratio, departmental comparisons, and trends.
Relationship Exploration: To investigate correlations between employee satisfaction, tenure, and turnover, identifying key factors influencing attrition.
Salary and Promotion Impact: To assess the impact of salary levels and promotion opportunities on employee turnover.
Modeling:
Correlation Analysis: To identify significant predictors of employee turnover using statistical methods, guiding feature selection for predictive models.
Predictive Modeling: To develop RandomForestClassifier, XGBoost, and LogisticRegression models to predict employee turnover. Evaluate models based on accuracy and AUC to inform decision-making and intervention strategies.
Data Inspection¶
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool,FactorRange
from bokeh.io import output_notebook
import seaborn as sns
from bokeh.transform import dodge
from bokeh.palettes import Category20
from sklearn.model_selection import train_test_split,cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc,roc_auc_score
from sklearn.preprocessing import LabelEncoder
from bokeh.resources import settings
settings.resources = 'inline'
Data Loading¶
# data loading
data = pd.read_csv("HR_comma_sep.csv")
data.head()
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | sales | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
Check for Missing Values¶
data.isnull().sum()
satisfaction_level 0 last_evaluation 0 number_project 0 average_montly_hours 0 time_spend_company 0 Work_accident 0 left 0 promotion_last_5years 0 sales 0 salary 0 dtype: int64
rename columns¶
# Renaming the 'sales' column to 'department'
data = data.rename(columns={"sales": "department"})
data.columns
Index(['satisfaction_level', 'last_evaluation', 'number_project',
'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
'promotion_last_5years', 'department', 'salary'],
dtype='object')
View Data Information¶
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14999 entries, 0 to 14998 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 satisfaction_level 14999 non-null float64 1 last_evaluation 14999 non-null float64 2 number_project 14999 non-null int64 3 average_montly_hours 14999 non-null int64 4 time_spend_company 14999 non-null int64 5 Work_accident 14999 non-null int64 6 left 14999 non-null int64 7 promotion_last_5years 14999 non-null int64 8 department 14999 non-null object 9 salary 14999 non-null object dtypes: float64(2), int64(6), object(2) memory usage: 1.1+ MB
Show Information of All Categorical Features¶
data.describe(include=['O'])
| department | salary | |
|---|---|---|
| count | 14999 | 14999 |
| unique | 10 | 3 |
| top | sales | low |
| freq | 4140 | 7316 |
Categorical Digitization¶
# Convert 'salary' and 'department' to categorical data type
data['salary'] = data['salary'].astype('category')
data['department'] = data['department'].astype('category')
salary_dict = dict(enumerate(data['salary'].cat.categories))
department_dict = dict(enumerate(data['department'].cat.categories))
# Create mapping dictionaries
salary_mapping = dict(enumerate(data['salary'].cat.categories))
salary_mapping_reverse = {v: k for k, v in salary_mapping.items()}
department_mapping = dict(enumerate(data['department'].cat.categories))
department_mapping_reverse = {v: k for k, v in department_mapping.items()}
# Create new columns for numerical values
data['salary_num'] = data['salary'].cat.codes
data['department_num'] = data['department'].cat.codes
data.head()
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | department | salary | salary_num | department_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low | 1 | 7 |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium | 2 | 7 |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium | 2 | 7 |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low | 1 | 7 |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low | 1 | 7 |
Data Analysis¶
Descriptive Analysis¶
# Exclude categorical columns for mean calculations
columns_to_exclude = ['department', 'salary']
numeric_df = data.drop(columns=columns_to_exclude)
# Group by the 'left' column
left_summary = numeric_df.groupby(by=['left'])
numeric_df.mean()
satisfaction_level 0.612834 last_evaluation 0.716102 number_project 3.803054 average_montly_hours 201.050337 time_spend_company 3.498233 Work_accident 0.144610 left 0.238083 promotion_last_5years 0.021268 salary_num 1.347290 department_num 5.870525 dtype: float64
numeric_df.describe()
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | salary_num | department_num | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
| mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 | 1.347290 | 5.870525 |
| std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 | 0.625819 | 2.868786 |
| min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 4.000000 |
| 50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 7.000000 |
| 75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 8.000000 |
| max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 9.000000 |
Employee Turnover Ratio¶
# Count the number of employees who stayed and left
left_count = numeric_df['left'].value_counts().reset_index(name="count")
# Plotting a pie chart to show the turnover ratio
labels = ['Stayed', 'Left']
sizes = left_count['count']
colors = ['#1f77b4', '#ff7f0e'] # Blue and orange colors
explode = (0, 0.1) # Explode the 'Left' slice
plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, colors=colors, explode=explode,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.title("Employee Turnover Ratio")
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
Relationship Between Employee Satisfaction and Years Spent at Company¶
# Calculate average satisfaction level based on years spent at the company
time_mean_satisfaction = numeric_df.groupby('time_spend_company')['satisfaction_level'].mean().reset_index(name="average_satisfaction")
# Plotting a bar chart to show satisfaction vs. years spent at the company
plt.figure(figsize=(10, 6))
plt.bar(time_mean_satisfaction['time_spend_company'], time_mean_satisfaction['average_satisfaction'], color='#1f77b4')
plt.title("Employee Satisfaction vs. Years Spent at Company")
plt.xlabel("Years at Company")
plt.ylabel("Average Satisfaction")
plt.grid(True)
plt.show()
Comparison of Employee Turnover by Department¶
# Convert the crosstab to a format suitable for Bokeh
depart_left_table = pd.crosstab(index=numeric_df['department_num'], columns=numeric_df['left'])
depart_left_table.reset_index(inplace=True)
# Map numerical department values to categorical names
departments = [department_dict[dept] for dept in depart_left_table['department_num']]
stayed = depart_left_table[0].tolist()
left = depart_left_table[1].tolist()
data_left = {'departments': departments,
'Stayed': stayed,
'Left': left}
source = ColumnDataSource(data=data_left)
# Create the figure
p = figure(y_range=FactorRange(*departments), height=400, width=800,
title="Employee Turnover by Department", toolbar_location=None, tools="")
# Add bars
p.hbar(y=dodge('departments', -0.15, range=p.y_range), right='Stayed', height=0.3, source=source,
color=Category20[20][0], legend_label="Stayed")
p.hbar(y=dodge('departments', 0.15, range=p.y_range), right='Left', height=0.3, source=source,
color=Category20[20][1], legend_label="Left")
# Customize the plot
p.y_range.range_padding = 0.1
p.ygrid.grid_line_color = None
p.xaxis.axis_label = "Count"
p.yaxis.axis_label = "Department"
p.legend.orientation = "horizontal"
p.legend.location = "top_center"
# Output to notebook
output_notebook()
show(p)
Summary: The sales department has the highest number of employees leaving, with 1,014 people. This is followed by the technical department, with 697 people leaving.
Employee Salary Levels by Department¶
# Crosstab to compare salary distribution by department
depart_salary_table = pd.crosstab(index=data['department'], columns=data['salary'])
# Ensure all salary levels are present for each department
depart_salary_table = depart_salary_table.reindex(columns=salary_dict.values(), fill_value=0)
# Convert the Crosstab data to Bokeh-compatible format
departments = depart_salary_table.index.tolist()
low_salary = depart_salary_table[salary_dict[0]].tolist()
medium_salary = depart_salary_table[salary_dict[1]].tolist()
high_salary = depart_salary_table[salary_dict[2]].tolist()
source = ColumnDataSource(data={
'departments': departments,
'Low Salary': low_salary,
'Medium Salary': medium_salary,
'High Salary': high_salary
})
# Create the figure
p = figure(x_range=departments, height=400, width=800,
title="Employee Salary Levels by Department", toolbar_location=None, tools="")
# Add vertical bars with hover tooltips
p.vbar_stack(stackers=['Low Salary', 'Medium Salary', 'High Salary'], x='departments', width=0.9, color=['#1f77b4', '#ff7f0e', '#2ca02c'],
source=source, legend_label=['Low Salary', 'Medium Salary', 'High Salary'])
# Hover tooltip configuration
hover = HoverTool()
hover.tooltips = [
('Department', '@departments'),
('Low Salary', '@{Low Salary}'),
('Medium Salary', '@{Medium Salary}'),
('High Salary', '@{High Salary}')
]
p.add_tools(hover)
# Customize the plot
p.y_range.start = 0
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_right"
p.legend.orientation = "vertical"
# Show plot
output_notebook()
show(p)
Summary: In the sales department, the number of employees with low salary is the highest, totaling 269 people. This is followed by the technical department with 201 people and the support department with 141 people.
Impact of Employee Salary on Turnover Rate¶
# Crosstab to compare turnover by salary level
salary_left_table = pd.crosstab(index=data['salary'], columns=data['left'])
# Plotting bar chart to show turnover rate by salary level
fig, ax = plt.subplots(figsize=(6, 5))
bars = salary_left_table.plot(kind='bar', stacked=False, ax=ax, color=['#1f77b4', '#ff7f0e'])
# Setting legend labels
ax.legend(['Stay', 'Left'])
ax.set_title("Impact of Employee Salary on Turnover")
ax.set_xlabel("Salary")
ax.set_ylabel("")
plt.xticks(rotation=0)
# Adding labels on top of each bar
for bar in bars.patches:
# Get the height of the bar
height = bar.get_height()
# Add text annotation above each bar with its value
ax.annotate(f'{height}', xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
plt.show()
Summary: The turnover rate for employees with low and medium salaries is relatively high, at 42% and 26% respectively, while the turnover rate for employees with high salaries is only 7%.
Comparison of Employee Promotions in Last 5 Years and Turnover¶
# Crosstab to compare turnover by promotion status
promotion_left_table = pd.crosstab(index=data['promotion_last_5years'], columns=data['left'])
promotion_dict = {0: "No Promotion", 1: "Promotion"}
# Plotting bar chart to show turnover rate by promotion status
fig, ax = plt.subplots(figsize=(6, 6))
promotion_left_table.plot(kind='bar', stacked=False, ax=ax, color=['#1f77b4', '#ff7f0e'])
ax.legend(['Stay', 'Left'])
ax.set_title("Impact of Employee Promotions in Last 5 Years on Turnover")
ax.set_xlabel("Promotion Status")
ax.set_ylabel("Count")
ax.set_xticklabels([promotion_dict[val] for val in promotion_left_table.index], rotation=0)
plt.show()
Summary: The turnover rate for employees who have not been promoted in the past 5 years is much higher compared to those who have been promoted. 94% of the employees who have been promoted are still employed.
Comparison of Employee Performance Evaluation and Turnover¶
evaluation_left_table = pd.crosstab(index=data['last_evaluation'], columns=data['left'])
# Plotting bar chart to show turnover rate by performance evaluation
fig, ax = plt.subplots(figsize=(10, 5))
evaluation_left_table.plot(kind='bar', stacked=False, ax=ax, color=['#1f77b4', '#ff7f0e'])
ax.legend(['Stay', 'Left'])
ax.set_title("Impact of Employee Performance Evaluation on Turnover")
ax.set_xlabel("Performance Evaluation")
ax.set_ylabel("Count")
plt.xticks(rotation=90)
plt.show()
Summary: Many of the employees who left the company had received high performance evaluations.
Modeling¶
Correlation Analysis¶
corr = numeric_df.corr()
mask = np.zeros_like(corr)
mask[np.tril_indices_from(mask)]=True
with sns.axes_style("white"):
sns.set(rc={'figure.figsize':(7,7)})
ax = sns.heatmap(corr,
xticklabels=True, yticklabels=True,
cmap='RdBu',
mask=mask,
fmt='.3f',
annot=True,
linewidths=.5,
vmax=.4,
square = True
# center = 0
)
plt.title("Correlation")
label_x = ax.get_xticklabels()
plt.setp(label_x,rotation=45, horizontalalignment='right')
plt.show()
Summary:
Satisfaction Level and Time Spent at the Company appear to have the strongest correlations with Employee Turnover (left).
Number of Projects and Average Monthly Hours also show moderate correlations with each other and with Last Evaluation, suggesting potential workload implications.
Work Accident, Promotion in Last 5 Years, Salary Level, and Department show minimal correlations with other variables in the dataset.
Prepared training and testing sets¶
# Selecting relevant features and target variable
features = ['satisfaction_level', 'last_evaluation', 'number_project',
'average_montly_hours', 'time_spend_company', 'Work_accident',
'promotion_last_5years']
target = 'left'
X = data[features]
y = data[target]
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
RandomForestClassifier¶
# Creating the RandomForestClassifier
rdf = RandomForestClassifier(n_estimators=100, random_state=42)
cv=KFold(n_splits=10, shuffle = True, random_state=90)
rdf_score=cross_val_score(rdf,X_train,y_train,cv=cv).mean()
# Training the model
rdf.fit(X_train, y_train)
# Predicting probabilities on the test set
y_pred_prob = rdf.predict_proba(X_test)[:, 1]
rdf_y_pred = rdf.predict(X_test)
# Computing ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)
rdf_auc = roc_auc_score(y_test,y_pred_prob)
# Plotting ROC curve
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()
XGBoost¶
!pip install xgboost
Requirement already satisfied: xgboost in /usr/local/lib/python3.10/dist-packages (2.0.3) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.25.2) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.11.4)
from xgboost import XGBClassifier
xgb = XGBClassifier(n_estimators=100,random_state=42)
xgb_score=cross_val_score(xgb,X_train,y_train,cv=cv).mean()
xgb.fit(X_train, y_train)
# Predict probabilities for the test set
y_pred_prob = xgb.predict_proba(X_test)[:, 1]
xgb_y_pred = xgb.predict(X_test)
# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
xgb_auc = roc_auc_score(y_test,y_pred_prob)
# Plot the ROC curve
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, color='blue', label=f'XGBoost (AUC = {roc_auc_score(y_test, y_pred_prob):.2f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
LogisticRegression¶
from sklearn.linear_model import LogisticRegression as LR
# Train Logistic Regression model
logisticR = LR(max_iter=1000)
logisticR_score=cross_val_score(logisticR,X_train,y_train,cv=cv).mean()
logisticR.fit(X_train, y_train)
# Predict probabilities for the test set
y_pred_prob = logisticR.predict_proba(X_test)[:, 1]
logisticR_y_pred = logisticR.predict(X_test)
# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
logisticR_auc = roc_auc_score(y_test,y_pred_prob)
# Plot the ROC curve
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, color='blue', label=f'Logistic Regression (AUC = {roc_auc_score(y_test, y_pred_prob):.2f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
Summary¶
# 查看各模型classification_report
from sklearn.metrics import classification_report as CR
print('RandomForest'.center(50), CR(y_test,rdf_y_pred),sep='\n')
print('XGBoost'.center(55),CR(y_test,xgb_y_pred),sep='\n')
print('LogisticRegression'.center(50),CR(y_test,logisticR_y_pred),sep='\n')
RandomForest
precision recall f1-score support
0 0.99 1.00 0.99 3428
1 0.99 0.96 0.98 1072
accuracy 0.99 4500
macro avg 0.99 0.98 0.98 4500
weighted avg 0.99 0.99 0.99 4500
XGBoost
precision recall f1-score support
0 0.99 0.99 0.99 3428
1 0.98 0.95 0.97 1072
accuracy 0.98 4500
macro avg 0.98 0.97 0.98 4500
weighted avg 0.98 0.98 0.98 4500
LogisticRegression
precision recall f1-score support
0 0.79 0.92 0.85 3428
1 0.49 0.24 0.32 1072
accuracy 0.76 4500
macro avg 0.64 0.58 0.59 4500
weighted avg 0.72 0.76 0.73 4500
score={'Model_score':[rdf_score,xgb_score,logisticR_score],'Auc_area':[rdf_auc,xgb_auc,logisticR_auc]}
score_com=pd.DataFrame(data=score,index=['RandomForest','XGBoost','LogisticRegression'])
score_com.sort_values(by=['Model_score'],ascending=False)
| Model_score | Auc_area | |
|---|---|---|
| RandomForest | 0.990760 | 0.991931 |
| XGBoost | 0.986855 | 0.991493 |
| LogisticRegression | 0.768929 | 0.801972 |
RandomForest demonstrates the best performance in terms of both accuracy and AUC.
XGBoost follows closely behind, making both ensemble methods preferable choices for predicting employee turnover.
LogisticRegression does not perform as well in this particular case and is less suitable for deployment if high predictive accuracy is required.